perf(blitter): +15% AvP gameplay accurate via inlining ADDARRAY/DATA/COMP_CTRL#129
Merged
perf(blitter): +15% AvP gameplay accurate via inlining ADDARRAY/DATA/COMP_CTRL#129
Conversation
Generalizes the ad-hoc BLITTER_PROFILE pattern into a reusable,
zero-overhead-when-off counter system any subsystem can use.
* src/core/perf_counters.{h,c} - PERF_COUNTER / PERF_INC / PERF_ADD
macros backed by a constructor-registered linked list. When
BENCH_PROFILE is undefined every macro expands to (0) so there is
no runtime, code-size, or symbol cost in shipped builds.
* src/tom/blitter.c - migrate the existing BLITTER_PROFILE counters
in BlitterMidsummer2 onto the new system. Counters are embedded in
existing initializers via the comma operator so the file stays
C89-clean (no statements before declarations).
* Makefile - `make BENCH_PROFILE=1` defines the macro globally.
`make benchmark BENCH_PROFILE=1` re-invokes with TEST_EXPORTS=1 so
test_benchmark can dlsym `perf_counters_dump` and print all
registered counters next to the FPS report.
* test/tools/test_benchmark.c - dlsym the optional dump symbol; if
present (BENCH_PROFILE build), call it before the BENCHMARK RESULTS
block. No effect on default builds.
* exports-test.list / link-test.T - add perf_counters_{dump,reset,
register} so harnesses can reach them under TEST_EXPORTS=1.
* scripts/profile-mac.sh - one-line wrapper around `xctrace record`.
Defaults to Time Profiler; --template "CPU Counters" for PMU
events on Apple Silicon. Builds the core + harness, runs the
benchmark under instrumentation, writes a .trace bundle, and can
auto-open it with --open.
* docs/profiling.md - new sections covering BENCH_PROFILE counters
(when to use vs sampling profilers) and the profile-mac.sh wrapper.
Validated end-to-end with `make benchmark BENCH_PROFILE=1
BENCH_BLITTER=accurate`: counters populate
(blitter_inner=283994 over 120 frames of yarc), and default builds
remain unchanged.
Co-Authored-By: Claude Opus 4.7 <[email protected]>
Two small follow-ups to the perf_counters baseline:
* Move the perf_counters_register/dump/reset declarations out of
the BENCH_PROFILE ifdef in the header, and provide always-defined
no-op bodies in perf_counters.c. This lets the test ABI export
the symbols unconditionally (added back to exports-test.list /
link-test.T) so test_benchmark can dlsym them in any build flavor;
in non-BENCH_PROFILE builds the bodies are empty.
* Makefile benchmark target gains a BENCH_STATE knob:
make benchmark BENCH_ROM=foo.j64 BENCH_STATE=foo.state6
Plumbed through to test_benchmark --load-state, which already
supports both raw retro_serialize payloads and RetroArch RASTATE
containers (lands via PR #128). Lets us profile actual gameplay
scenes instead of boot/menu idle.
Co-Authored-By: Claude Opus 4.7 <[email protected]>
Profile data on AvP gameplay (state6, accurate blitter) showed
ADDARRAY as the single largest leaf in the entire emulator at
1910 sample-of-stack hits, with DATA (759) and COMP_CTRL (318)
not far behind. All four are called from the BlitterMidsummer2
inner loop only, and most call sites pass compile-time-constant
flags for daddasel/daddbsel/daddmode/sat/eightbit/hicinh/etc --
ideal candidates for per-call-site specialisation through the
compiler if the bodies become visible at the call site.
This commit moves the four definitions above BlitterMidsummer2
(in the order ADD16SAT -> ADDARRAY -> COMP_CTRL -> DATA so each
sees its dependencies) and marks them
`static INLINE __attribute__((always_inline))`. No body changes;
this is purely a re-arrangement so the compiler can do dead-arm
elimination and constant propagation across the call boundary.
Removed the matching extern forward declarations now that the
definitions provide the prototype.
Measured (Apple M-series, headless `make benchmark` against the
private AvP ROM with state6 loaded, accurate blitter, 600 frames
after 60 warmup, 3-run median):
BlitterMidsummer2 + callees, sample-of-stack
before: ~5268 (BM2 2281, ADDARRAY 1910, DATA 759, COMP_CTRL 318)
after: ~4592 (BM2 absorbs the four inlinees)
AvP accurate FPS
baseline: 173-176
+ADDARRAY: 192-195
+DATA+COMP: 198-201 (~+15% net)
Fast-blitter perf unchanged (within ~3% run-to-run noise).
test_blitter_compare and the rest of `make test` pass.
Bit-exactness preserved: the function bodies are byte-for-byte
identical to the originals, only their linkage and source-file
position changed.
Addresses real-world AvP-on-RetroArch slowdown / audio-dropout
report on Apple Silicon, where the extra ~25 FPS recovers enough
budget for presentation + audio mixing to fit in 16.6 ms.
Co-Authored-By: Claude Opus 4.7 <[email protected]>
The +15% inlining win shipped in the previous commit didn't fix the real-world AvP audio stutter, because audio dropouts are governed by *worst-case* frame time, not average. This commit lands the diagnostic infrastructure that found the actual cost driver. * test_benchmark now records per-frame timing and reports p50 / p99 / p999 / max ms-per-frame plus the count of frames that blew the 16.67 ms (60 Hz) budget. * perf_counters.h gains `perf_counters_find(name)` so the harness can snapshot a counter before/after each retro_run() call and report per-frame deltas. * test_benchmark uses the new lookup to print the slowest frames along with their blitter call count and inner-iteration count, and the slow-vs-avg ratio. * Makefile gains BLITTER_TRACE=1 toggle that wires -DBLITTER_TRACE; src/tom/blitter.c grows a per-blit elapsed-ms dump that fires for any single BlitterMidsummer2 call exceeding a threshold, so we can tell whether worst-case frame spikes are one giant blit or many small ones. What this revealed for AvP gameplay (state7, accurate blitter, 1200 frames after 120 warmup, M-series host): avg 5.1 ms, p99 17.0 ms, p999 18.2 ms, max 18.4 ms 25 / 1200 frames over 16.67 ms (2%) Avg per frame: 403 blits, 49,774 inner iters Slow per frame: 2009 blits, 247,508 inner iters (5x) The slow frames are perfectly periodic and consistent (5x the work on every spike, every spike same exact count). No single blit is slow -- the BLITTER_TRACE dump prints zero lines even at a 0.3 ms threshold. AvP is just doing 5x more blitter work on roughly every 24th frame. That's the audio dropout root cause: the inlining helped average throughput but the periodic 5x spikes still blow the budget. Co-Authored-By: Claude Opus 4.7 <[email protected]>
Author
Regression:
|
| ROM | Status | Details | Diff |
|---|---|---|---|
| jagniccc | ✅ PASS | 0 pixels differ | - |
| yarc | ✅ PASS | 0 pixels differ | - |
| jagniccc (determinism) | ✅ PASS | identical across runs | - |
| yarc (determinism) | ✅ PASS | identical across runs | - |
| jagniccc (frameskip) | ✅ PASS | skip=0 matches skip=3 | - |
| yarc (frameskip) | ✅ PASS | skip=0 matches skip=3 | - |
| jagniccc (save state) | ✅ PASS | round-trip matches | - |
| yarc (save state) | ✅ PASS | round-trip matches | - |
| jagniccc (rewind) | ✅ PASS | rewind matches | - |
| yarc (rewind) | ✅ PASS | rewind matches | - |
Platform: Darwin arm64
Updated by CI at 2026-05-02T19:51:41.719Z
Author
Regression:
|
| ROM | Status | Details | Diff |
|---|---|---|---|
| jagniccc | ✅ PASS | 0 pixels differ | - |
| yarc | ✅ PASS | 0 pixels differ | - |
| jagniccc (determinism) | ✅ PASS | identical across runs | - |
| yarc (determinism) | ✅ PASS | identical across runs | - |
| jagniccc (frameskip) | ✅ PASS | skip=0 matches skip=3 | - |
| yarc (frameskip) | ✅ PASS | skip=0 matches skip=3 | - |
| jagniccc (save state) | ✅ PASS | round-trip matches | - |
| yarc (save state) | ✅ PASS | round-trip matches | - |
| jagniccc (rewind) | ✅ PASS | rewind matches | - |
| yarc (rewind) | ✅ PASS | rewind matches | - |
Platform: Linux x86_64
Updated by CI at 2026-05-02T19:52:00.719Z
Author
Regression:
|
| ROM | Status | Details | Diff |
|---|---|---|---|
| jagniccc | ✅ PASS | 0 pixels differ | - |
| yarc | ✅ PASS | 0 pixels differ | - |
| jagniccc (determinism) | ✅ PASS | identical across runs | - |
| yarc (determinism) | ✅ PASS | identical across runs | - |
| jagniccc (frameskip) | ✅ PASS | skip=0 matches skip=3 | - |
| yarc (frameskip) | ✅ PASS | skip=0 matches skip=3 | - |
| jagniccc (save state) | ✅ PASS | round-trip matches | - |
| yarc (save state) | ✅ PASS | round-trip matches | - |
| jagniccc (rewind) | ✅ PASS | rewind matches | - |
| yarc (rewind) | ✅ PASS | rewind matches | - |
Platform: Linux aarch64
Updated by CI at 2026-05-02T19:51:49.819Z
MSVC x86 / x64 compilation checks failed on PR #129 because `__attribute__((always_inline))` is GCC/Clang-specific. Wrap it in a BLITTER_ALWAYS_INLINE macro that maps to: GCC / Clang: inline __attribute__((always_inline)) MSVC: __forceinline (replaces inline) Other: inline (best-effort) The macro spells the inline keyword itself so call sites are just `static BLITTER_ALWAYS_INLINE void foo(...)` -- no extra INLINE qualifier (MSVC's __forceinline conflicts with another inline keyword, which is why the original `static INLINE __attribute__(...)` form would have failed there even if MSVC understood the attribute). Verified: clang still inlines (AvP accurate ~196 FPS, same as the attribute-only form); test suite passes; libretro buildbot's MSVC target should now build clean. Co-Authored-By: Claude Opus 4.7 <[email protected]>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR focuses on improving performance of the accurate blitter path (notably AvP gameplay on Apple Silicon) by enabling stronger compiler specialization/inlining in the hottest blitter helpers, and by adding optional profiling infrastructure plus benchmark workflow improvements to better measure gameplay-like performance.
Changes:
- Inline hot blitter helpers (
ADD16SAT,ADDARRAY,COMP_CTRL,DATA) and add optional blitter perf counters/tracing hooks. - Add an opt-in
perf_countersinstrumentation system (enabled viaBENCH_PROFILE=1) and expose it through the test-export ABI. - Enhance
make benchmark/test_benchmarkto support loading a save state and reporting per-frame timing variance; add a macOSxctracewrapper and profiling docs updates.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/tom/blitter.c |
Forces inlining of hot helper routines and adds optional perf counters + tracing instrumentation around the accurate blitter inner loop. |
src/core/perf_counters.h |
Introduces opt-in counter macros and registry API for lightweight instrumentation. |
src/core/perf_counters.c |
Implements the counter registry, lookup, dump, and reset behavior. |
test/tools/test_benchmark.c |
Adds optional perf counter access plus per-frame timing samples/percentiles and worst-frame reporting. |
Makefile |
Adds BENCH_PROFILE/BLITTER_TRACE build flags and enhances benchmark target to support state loading and optional wide exports. |
Makefile.common |
Adds perf_counters.c to the core build sources. |
link-test.T |
Exports perf counter symbols for the test ABI (ELF). |
exports-test.list |
Exports perf counter symbols for the test ABI (Mach-O). |
scripts/profile-mac.sh |
Adds a macOS Instruments (xctrace) wrapper for running the benchmark under templates. |
docs/profiling.md |
Documents the new macOS profiling workflow and the BENCH_PROFILE=1 counters. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Copilot review on PR #129 caught that the per-frame stats path unconditionally allocates / sorts / divides by num_frames, which would OOB-access sorted[num_frames - 1] and divide by zero if a caller passed 0 or a negative number. Both come from atoi() which returns 0 on garbage input, so this was reachable. Validate both at parse time and exit with a clear error before allocating anything. Co-Authored-By: Claude Opus 4.7 <[email protected]>
This was referenced May 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ADD16SAT/ADDARRAY/COMP_CTRL/DATA(static INLINE __attribute__((always_inline))) so the compiler can specialise per call site.perf_counters.hinstrumentation system (zero-overhead whenBENCH_PROFILEis undefined) that drove the diagnosis.make benchmarknow acceptsBENCH_STATE=...to load a save state before timing — lets us measure actual gameplay instead of boot/menu.scripts/profile-mac.shxctrace wrapper anddocs/profiling.mdupdates documenting the workflow.Motivation
Reported in chat: AvP on RetroArch with the accurate blitter has audio dropouts on Apple Silicon. Headless
make benchmarkagainst the AvP gameplay state confirmed accurate is ~2× slower than fast (346 → 174 FPS). xctrace Time Profiler showedADDARRAYas the single largest leaf in the entire emulator (1910 samples), withDATA(759) andCOMP_CTRL(318) following — all called only from theBlitterMidsummer2inner loop, all with mostly-constant flag arguments.Result
Sample-of-stack hits in the blitter chain went from 5268 → 4592.
Test plan
make testpasses (existing audio-clipping warnings are pre-existing EXPECTED-FAIL on AvP)make benchmark3-run consistency on AvP state6 fast + accuratescripts/c89-lint.shcleanNotes
🤖 Generated with Claude Code